adding CombineInputFileFormat; only single use case so far#3
adding CombineInputFileFormat; only single use case so far#3thomasstorm wants to merge 2 commits intomasterfrom
Conversation
| File inputFile = getInputFile(); | ||
| if (inputFile == null) { | ||
| for (Path inputPath : inputPaths) { | ||
| inputFile = CalvalusProductIO.copyFileToLocal(inputPath, getConfiguration()); | ||
| setInputFile(inputFile); | ||
| if (inputFile == null) { | ||
| setInputFile(inputFile); | ||
| } | ||
| } |
There was a problem hiding this comment.
Here, it seems the logic has unintentionally changed. The test for null was located before the second assignment of the copyFileToLocal before, and will never be true now.
| /** | ||
| * @author thomas | ||
| */ | ||
| public class CombineFileInputFormat extends InputFormat { |
There was a problem hiding this comment.
Is there a relation to org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat ?
| * Creates a single split from a given pattern | ||
| */ | ||
| @Override | ||
| public List<InputSplit> getSplits(JobContext context) throws IOException { |
There was a problem hiding this comment.
What about our other methods to determine inputs, in particular those using the geo-inventory? I know that PatternBasedInputFormat needs refactoring and decomposition but I think the other ways to determine inputs are required.
Thinking of how to refactor PatternBasedInputFormat it may be good to distinguish the way the inputs shall be determined (geo-inventory, opensearch query, path pattern, ...) by different classes as they have different parameters anyway, and whichever parameter is specified the client could automatically select the right class. Then, we could either derive a class for CombineFileSplit generation from each of them, or we make this a parameter. In any case, the old PatternBasedInputFormat could delegate the getSplits() call to the new implementations to keep backwards compatibility.
No description provided.